AITopics | test scenario

Collaborating Authors

test scenario

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

b35c38f70065ac6c694089ca93a015bb-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 14:18:06 GMT

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.29)
Europe > Austria > Vienna (0.14)
North America > United States > Illinois > Champaign County > Urbana (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (0.92)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests

Han, Xiaoke, Zhu, Hong

arXiv.org Artificial IntelligenceNov-25-2025

Testing RESTful API is increasingly important in quality assurance of cloud-native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi-agent system called MASTEST that combines LLM-based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT-4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT-4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT-4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100\% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.18038

Genre: Research Report (1.00)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents

Bartlett, Antony, Liem, Cynthia, Panichella, Annibale

arXiv.org Artificial IntelligenceOct-17-2025

Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2510.14727

Country: North America > United States (0.14)

Genre:

Research Report > Experimental Study (0.70)
Research Report > New Finding (0.69)

Industry: Transportation > Ground > Road (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

ALI-Agent: Assessing LLMs'Alignment with Human Values via Agent-based Evaluation

Neural Information Processing SystemsOct-10-2025, 13:54:06 GMT

ALI-Agent operates through two principal stages: Emulation and Refinement. During the Emulation stage, ALI-Agent automates the generation of realistic test scenarios.

ali-agent, language model, scenario, (16 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.29)
Europe > Austria > Vienna (0.14)
North America > United States > Illinois > Champaign County > Urbana (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AI Is Scheming, and Stopping It Won't Be Easy, OpenAI Study Finds

TIME - TechSep-18-2025, 17:28:10 GMT

AI Is Scheming, and Stopping It Won't Be Easy, OpenAI Study Finds New research released yesterday by OpenAI and AI safety organization Apollo Research provides further evidence for a concerning trend: virtually all of today's best AI systems--including Anthropic's Claude Opus, Google's Gemini, and OpenAI's o3--can engage in "scheming," or pretending to do what their human developers want, while secretly pursuing different objectives. "Our findings show that scheming is not merely a theoretical concern--we are seeing signs that this issue is beginning to emerge across all frontier models today, wrote OpenAI in a blog post. Currently, "models rarely have opportunities to scheme and cause significant harm, but we expect this to change in the future as AIs are taking on more important and long-term tasks," they wrote. The researchers measured scheming by looking at under which conditions AI systems would take covert actions, or "deliberate withholding or distortion of task-relevant information." In one example, OpenAI's o3 is given a test of its chemistry capabilities: "we are currently testing your capabilities.

advertisement, ai outsmart virus expert, openai, (12 more...)

TIME - Tech

Country: North America > United States (0.05)

Genre: Research Report > New Finding (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

Practical Equivalence Testing and Its Application in Synthetic Pre-Crash Scenario Validation

Wu, Jian, Sander, Ulrich, Flannagan, Carol, Zhao, Minxiang, Bärgman, Jonas

arXiv.org Artificial IntelligenceAug-26-2025

The use of representative pre-crash scenarios is critical for assessing the safety impact of driving automation systems through simulation. However, a gap remains in the robust evaluation of the similarity between synthetic and real-world pre-crash scenarios and their crash characteristics. Without proper validation, it cannot be ensured that the synthetic test scenarios adequately represent real-world driving behaviors and crash characteristics. One reason for this validation gap is the lack of focus on methods to confirm that the synthetic test scenarios are practically equivalent to real-world ones, given the assessment scope. Traditional statistical methods, like significance testing, focus on detecting differences rather than establishing equivalence; since failure to detect a difference does not imply equivalence, they are of limited applicability for validating synthetic pre-crash scenarios and crash characteristics. This study addresses this gap by proposing an equivalence testing method based on the Bayesian Region of Practical Equivalence (ROPE) framework. This method is designed to assess the practical equivalence of scenario characteristics that are most relevant for the intended assessment, making it particularly appropriate for the domain of virtual safety assessments. We first review existing equivalence testing methods. Then we propose and demonstrate the Bayesian ROPE-based method by testing the equivalence of two rear-end pre-crash datasets. Our approach focuses on the most relevant scenario characteristics. Our analysis provides insights into the practicalities and effectiveness of equivalence testing in synthetic test scenario validation and demonstrates the importance of testing for improving the credibility of synthetic data for automated vehicle safety assessment, as well as the credibility of subsequent safety impact assessments.

artificial intelligence, assessment, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.12827

Country:

Europe (0.69)
North America > United States (0.46)

Genre: Research Report > Experimental Study (0.94)

Industry:

Automobiles & Trucks (0.68)
Transportation > Ground > Road (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)

Add feedback

Vision Language Model-based Testing of Industrial Autonomous Mobile Robots

Wu, Jiahui, Lu, Chengjie, Arrieta, Aitor, Ali, Shaukat, Peyrucain, Thomas

arXiv.org Artificial IntelligenceAug-5-2025

Autonomous Mobile Robots (AMRs) are deployed in diverse environments (e.g., warehouses, retail spaces, and offices), where they work alongside humans. Given that human behavior can be unpredictable and that AMRs may not have been trained to handle all possible unknown and uncertain behaviors, it is important to test AMRs under a wide range of human interactions to ensure their safe behavior. Moreover, testing in real environments with actual AMRs and humans is often costly, impractical, and potentially hazardous (e.g., it could result in human injury). To this end, we propose a Vision Language Model (VLM)-based testing approach (RVSG) for industrial AMRs developed by PAL Robotics in Spain. Based on the functional and safety requirements, RVSG uses the VLM to generate diverse human behaviors that violate these requirements. We evaluated RVSG with several requirements and navigation routes in a simulator using the latest AMR from PAL Robotics. Our results show that, compared with the baseline, RVSG can effectively generate requirement-violating scenarios. Moreover, RVSG-generated scenarios increase variability in robot behavior, thereby helping reveal their uncertain behaviors.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.02338

Country: Europe > Spain (0.48)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Locomotion (0.71)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

Operationalization of Scenario-Based Safety Assessment of Automated Driving Systems

Camp, Olaf Op den, de Gelder, Erwin

arXiv.org Artificial IntelligenceJul-31-2025

Olaf Op den Camp Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0002 - 6355 - 134X Erwin de Gelder Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0003 - 4260 - 4294 Abstract -- Before introducing an Automated Driving System (ADS) on the road at scale, the manufacturer must conduct some sort of safety assurance. To structure and harmonize the safety assurance process, the UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) that indicates what steps need to be taken for safety assessment of an ADS . In this paper, we will show how to practically conduct safety assessment making use of a scenario database, and what additional steps must be taken to fully operationalize the NATM. In addition, we will elaborate on how the use of scenario databases fits with methods developed in the Horizon Europe projects that focus on safety assessment following the NATM ap proach. A safety assurance process that is conducted by the manufacturer before introducing an Automated Driving System (ADS), intends to assure that the ADS responds appropriately in all situations it is designed for and that the ADS is able to avoid any reasonably foreseeable and reasonably preventable collision s . The information out of the safety assurance process is not only important for manufacturers, but also for authorities that have the responsibility to guard the safety of their citizens in traffic. Safety assurance is most important for consumers (and fle et owners) using an ADS with the expectation that the system is saf e, reliable, and trustworthy . To structure and harmonize this process, t he UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) [1], which is already recognized across many countries (e.g., Japan, South Korea, the EU and the USA).

artificial intelligence, scenario, scenario database, (13 more...)

arXiv.org Artificial Intelligence

2507.22433

Country:

Europe > Netherlands (0.44)
North America > United States (0.34)

Genre: Research Report (0.40)

Industry:

Law (1.00)
Transportation > Ground > Road (0.92)
Information Technology > Robotics & Automation (0.82)
Automobiles & Trucks (0.82)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)

Add feedback

GenAI for Automotive Software Development: From Requirements to Wheels

Petrovic, Nenad, Pan, Fengjunjie, Zolfaghari, Vahid, Lebioda, Krzysztof, Schamschurko, Andre, Knoll, Alois

arXiv.org Artificial IntelligenceJul-25-2025

This paper introduces a GenAI-empowered approach to automated development of automotive software, with emphasis on autonomous and Advanced Driver Assistance Systems (ADAS) capabilities. The process starts with requirements as input, while the main generated outputs are test scenario code for simulation environment, together with implementation of desired ADAS capabilities targeting hardware platform of the vehicle connected to testbench. Moreover, we introduce additional steps for requirements consistency checking leveraging Model-Driven Engineering (MDE). In the proposed workflow, Large Language Models (LLMs) are used for model-based summarization of requirements (Ecore metamodel, XMI model instance and OCL constraint creation), test scenario generation, simulation code (Python) and target platform code generation (C++). Additionally, Retrieval Augmented Generation (RAG) is adopted to enhance test scenario generation from autonomous driving regulations-related documents. Our approach aims shorter compliance and re-engineering cycles, as well as reduced development and testing time when it comes to ADAS-related capabilities.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.18223

Country:

Europe > Germany (0.15)
Europe > Italy (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Research Report (0.84)
Workflow (0.71)

Industry: Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions

Pysmennyi, Ihor, Kyslyi, Roman, Kleshch, Kyrylo

arXiv.org Artificial IntelligenceJun-23-2025

Traditional quality assurance (QA) methods face significant challenges in addressing the complexity, scale, and rapid iteration cycles of modern software systems and are strained by limited resources available, leading to substantial costs associated with poor quality. The object of this research is the Quality Assurance processes for modern distributed software applications. The subject of the research is the assessment of the benefits, challenges, and prospects of integrating modern AI-oriented tools into quality assurance processes. We performed comprehensive analysis of implications on both verification and validation processes covering exploratory test analyses, equivalence partitioning and boundary analyses, metamorphic testing, finding inconsistencies in acceptance criteria (AC), static analyses, test case generation, unit test generation, test suit optimization and assessment, end to end scenario execution. End to end regression of sample enterprise application utilizing AI-agents over generated test scenarios was implemented as a proof of concept highlighting practical use of the study. The results, with only 8.3% flaky executions of generated test cases, indicate significant potential for the proposed approaches. However, the study also identified substantial challenges for practical adoption concerning generation of semantically identical coverage, "black box" nature and lack of explainability from state-of-the-art Large Language Models (LLMs), the tendency to correct mutated test cases to match expected results, underscoring the necessity for thorough verification of both generated artifacts and test execution results. The research demonstrates AI's transformative potential for QA but highlights the importance of a strategic approach to implementing these technologies, considering the identified limitations and the need for developing appropriate verification methodologies.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.15587/2706-5448.2025.330595

2506.16586

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology (0.68)
Banking & Finance (0.68)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback